108 research outputs found
When Simple Exploration is Sample Efficient: Identifying Sufficient Conditions for Random Exploration to Yield PAC RL Algorithms
Efficient exploration is one of the key challenges for reinforcement learning
(RL) algorithms. Most traditional sample efficiency bounds require strategic
exploration. Recently many deep RL algorithms with simple heuristic exploration
strategies that have few formal guarantees, achieve surprising success in many
domains. These results pose an important question about understanding these
exploration strategies such as -greedy, as well as understanding what
characterize the difficulty of exploration in MDPs. In this work we propose
problem specific sample complexity bounds of learning with random walk
exploration that rely on several structural properties. We also link our
theoretical results to some empirical benchmark domains, to illustrate if our
bound gives polynomial sample complexity in these domains and how that is
related with the empirical performance.Comment: Appeared in The 14th European Workshop on Reinforcement Learning
(EWRL), 201
The Online Coupon-Collector Problem and Its Application to Lifelong Reinforcement Learning
Transferring knowledge across a sequence of related tasks is an important
challenge in reinforcement learning (RL). Despite much encouraging empirical
evidence, there has been little theoretical analysis. In this paper, we study a
class of lifelong RL problems: the agent solves a sequence of tasks modeled as
finite Markov decision processes (MDPs), each of which is from a finite set of
MDPs with the same state/action sets and different transition/reward functions.
Motivated by the need for cross-task exploration in lifelong learning, we
formulate a novel online coupon-collector problem and give an optimal
algorithm. This allows us to develop a new lifelong RL algorithm, whose overall
sample complexity in a sequence of tasks is much smaller than single-task
learning, even if the sequence of tasks is generated by an adversary. Benefits
of the algorithm are demonstrated in simulated problems, including a recently
introduced human-robot interaction problem.Comment: 13 page
Latent Contextual Bandits and their Application to Personalized Recommendations for New Users
Personalized recommendations for new users, also known as the cold-start
problem, can be formulated as a contextual bandit problem. Existing contextual
bandit algorithms generally rely on features alone to capture user variability.
Such methods are inefficient in learning new users' interests. In this paper we
propose Latent Contextual Bandits. We consider both the benefit of leveraging a
set of learned latent user classes for new users, and how we can learn such
latent classes from prior users. We show that our approach achieves a better
regret bound than existing algorithms. We also demonstrate the benefit of our
approach using a large real world dataset and a preliminary user study.Comment: 25th International Joint Conference on Artificial Intelligence (IJCAI
2016
Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning
Recently, there has been significant progress in understanding reinforcement
learning in discounted infinite-horizon Markov decision processes (MDPs) by
deriving tight sample complexity bounds. However, in many real-world
applications, an interactive learning agent operates for a fixed or bounded
period of time, for example tutoring students for exams or handling customer
service requests. Such scenarios can often be better treated as episodic
fixed-horizon MDPs, for which only looser bounds on the sample complexity
exist. A natural notion of sample complexity in this setting is the number of
episodes required to guarantee a certain performance with high probability (PAC
guarantee). In this paper, we derive an upper PAC bound and a
lower PAC bound that match up to log-terms and an additional linear
dependency on the number of states . The lower bound is the first
of its kind for this setting. Our upper bound leverages Bernstein's inequality
to improve on previous bounds for episodic finite-horizon MDPs which have a
time-horizon dependency of at least .Comment: 28 pages, appeared in Neural Information Processing Systems (NIPS)
2015, updated version with fixed typos and modified Lemma 1 and Lemma C.
Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
In this paper we present a new way of predicting the performance of a
reinforcement learning policy given historical data that may have been
generated by a different policy. The ability to evaluate a policy from
historical data is important for applications where the deployment of a bad
policy can be dangerous or costly. We show empirically that our algorithm
produces estimates that often have orders of magnitude lower mean squared error
than existing methods---it makes more efficient use of the available data. Our
new estimator is based on two advances: an extension of the doubly robust
estimator (Jiang and Li, 2015), and a new way to mix between model based
estimates and importance sampling based estimates
Incentive Decision Processes
We consider Incentive Decision Processes, where a principal seeks to reduce
its costs due to another agent's behavior, by offering incentives to the agent
for alternate behavior. We focus on the case where a principal interacts with a
greedy agent whose preferences are hidden and static. Though IDPs can be
directly modeled as partially observable Markov decision processes (POMDP), we
show that it is possible to directly reduce or approximate the IDP as a
polynomially-sized MDP: when this representation is approximate, we prove the
resulting policy is boundedly-optimal for the original IDP. Our empirical
simulations demonstrate the performance benefit of our algorithms over simpler
approaches, and also demonstrate that our approximate representation results in
a significantly faster algorithm whose performance is extremely close to the
optimal policy for the original IDP.Comment: Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty
in Artificial Intelligence (UAI2012
Sample Efficient Feature Selection for Factored MDPs
In reinforcement learning, the state of the real world is often represented
by feature vectors. However, not all of the features may be pertinent for
solving the current task. We propose Feature Selection Explore and Exploit
(FS-EE), an algorithm that automatically selects the necessary features while
learning a Factored Markov Decision Process, and prove that under mild
assumptions, its sample complexity scales with the in-degree of the dynamics of
just the necessary features, rather than the in-degree of all features. This
can result in a much better sample complexity when the in-degree of the
necessary features is smaller than the in-degree of all features
Policy Gradient Methods for Reinforcement Learning with Function Approximation and Action-Dependent Baselines
We show how an action-dependent baseline can be used by the policy gradient
theorem using function approximation, originally presented with
action-independent baselines by (Sutton et al. 2000)
Learning When-to-Treat Policies
Many applied decision-making problems have a dynamic component: The
policymaker needs not only to choose whom to treat, but also when to start
which treatment. For example, a medical doctor may choose between postponing
treatment (watchful waiting) and prescribing one of several available
treatments during the many visits from a patient. We develop an "advantage
doubly robust" estimator for learning such dynamic treatment rules using
observational data under the assumption of sequential ignorability. We prove
welfare regret bounds that generalize results for doubly robust learning in the
single-step setting, and show promising empirical performance in several
different contexts. Our approach is practical for policy optimization, and does
not need any structural (e.g., Markovian) assumptions
Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning
Statistical performance bounds for reinforcement learning (RL) algorithms can
be critical for high-stakes applications like healthcare. This paper introduces
a new framework for theoretically measuring the performance of such algorithms
called Uniform-PAC, which is a strengthening of the classical Probably
Approximately Correct (PAC) framework. In contrast to the PAC framework, the
uniform version may be used to derive high probability regret guarantees and so
forms a bridge between the two setups that has been missing in the literature.
We demonstrate the benefits of the new framework for finite-state episodic MDPs
with a new algorithm that is Uniform-PAC and simultaneously achieves optimal
regret and PAC guarantees except for a factor of the horizon.Comment: appears in Neural Information Processing Systems 201
- …